08. RNN Hyperparameters
RNN Hyperparameters
LSTM Vs GRU
"These results clearly indicate the advantages of the gating units over the more traditional recurrent
units. Convergence is often faster, and the final solutions tend to be better. However, our results are
not conclusive in comparing the LSTM and the GRU, which suggests that the choice of the type of
gated recurrent unit may depend heavily on the dataset and corresponding task."
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling by Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Yoshua Bengio
"The GRU outperformed the LSTM on all tasks with the exception of language modelling"
An Empirical Exploration of Recurrent Network Architectures by Rafal Jozefowicz, Wojciech Zaremba, Ilya Sutskever
"Our consistent finding is that depth of at least two is
beneficial. However, between two and three layers our results are mixed. Additionally, the results
are mixed between the LSTM and the GRU, but both significantly outperform the RNN."
Visualizing and Understanding Recurrent Networks by Andrej Karpathy, Justin Johnson, Li Fei-Fei
"Which of these variants is best? Do the differences matter? Greff, et al. (2015) do a nice comparison of popular variants, finding that they’re all about the same. Jozefowicz, et al. (2015) tested more than ten thousand RNN architectures, finding some that worked better than LSTMs on certain tasks."
Understanding LSTM Networks by Chris Olah
"In our [Neural Machine Translation] experiments, LSTM cells consistently outperformed GRU cells. Since the computational bottleneck in our architecture is the softmax operation we did not observe large difference in training speed between LSTM and GRU cells. Somewhat to our surprise, we found that the vanilla decoder is unable to learn nearly as well as the gated variant."
Massive Exploration of Neural Machine Translation Architectures by Denny Britz, Anna Goldie, Minh-Thang Luong, Quoc Le
Example RNN Architectures
Application | Cell | Layers | Size | Vocabulary | Embedding Size | Learning Rate | |
---|---|---|---|---|---|---|---|
Speech Recognition (large vocabulary) | LSTM | 5, 7 | 600, 1000 | 82K, 500K | -- | -- | paper |
Speech Recognition | LSTM | 1, 3, 5 | 250 | -- | -- | 0.001 | paper |
Machine Translation (seq2seq) | LSTM | 4 | 1000 | Source: 160K, Target: 80K | 1,000 | -- | paper |
Image Captioning | LSTM | -- | 512 | -- | 512 | (fixed) | paper |
Image Generation | LSTM | -- | 256, 400, 800 | -- | -- | -- | paper |
Question Answering | LSTM | 2 | 500 | -- | 300 | -- | |
Text Summarization | GRU | 200 | Source: 119K, Target: 68K | 100 | 0.001 |